SYCL: reduce allocation overhead during flash attention by sanmai · Pull Request #22732 · ggml-org/llama.cpp

sanmai · 2026-05-05T22:52:38Z

Overview

I found that flash attention allocated quite a few K/V buffers with little reuse, which remained in the legacy pool until teardown. And it sounds like a better strategy is to allocate the FA buffers outside the common pool and grow them on demand. So that at most, FA uses the largest buffers it needs.

Arguably, there are more optimal strategies: we still leave the buffers occupied. That said, there aren't evictions in the legacy pool, so at very least we are in a better spot.

To reduce the number of allocations, buffers grow in chunks of 16 MiB
The requests grow like this: 0.5, 1.0, 1.5, 2.0, 2.5... but that could be model-dependent
FA completes before ggml_sycl_fattn_alloc::alloc is called again so no queue sync should be needed but I added one anyway just to be extra safe

Additional information

Memory benchmarks with B60 and Qwen3.6-35B-A3B UD-Q4_K_M, q4_0:

Baseline (without buffers)

Smaller prompt:

~ggml_sycl_pool_leg: 9 buffers, cached = 181.67 MiB
~ggml_sycl_pool_leg: slots MiB: 0.01/0.00/0.00/1.05/4.20/67.20/50.40/8.40/50.40

Larger prompt:

~ggml_sycl_pool_leg: 23 buffers, cached = 1170.24 MiB
~ggml_sycl_pool_leg: slots MiB: 0.01/0.00/0.00/1.05/4.20/50.40/85.05/8.40/50.40/53.02/56.17/59.32/62.47/65.62/67.20/69.30/70.88/72.97/74.55/76.65/78.75/80.85/82.95

With buffers

Smaller prompt:

ggml_sycl_fattn_kv_buffer[0]: 16.00 MiB
ggml_sycl_fattn_kv_buffer[0]: 16.00 MiB
~ggml_sycl_pool_leg: 7 buffers, cached = 80.87 MiB
~ggml_sycl_pool_leg: slots MiB: 0.01/0.00/0.00/1.05/4.20/67.20/8.40

68.80 MiB savings compared to the baseline.

Larger prompt:

ggml_sycl_fattn_kv_buffer[0]: 96.00 MiB
ggml_sycl_fattn_kv_buffer[0]: 96.00 MiB
~ggml_sycl_pool_leg: 7 buffers, cached = 80.87 MiB
~ggml_sycl_pool_leg: slots MiB: 0.01/0.00/0.00/1.05/4.20/67.20/8.40

that's 272.87 total for the pool plus the two FA buffers, or 897.37 MiB savings compare to the baseline. In a memory constrained environment it could be a deal breaker.

It spills into the common memory breakdown just as one expects.

 | memory breakdown [MiB]                        | total   free     self   model   context   compute    unaccounted |
-|   - SYCL0 (Intel(R) Arc(TM) Pro B60 Graphics) | 23256 =  854 + (20003 = 18727 +     782 +     493) +        2398 |
+|   - SYCL0 (Intel(R) Arc(TM) Pro B60 Graphics) | 23256 = 1129 + (20643 = 18727 +    1422 +     493) +        1483 |
 |   - Host                                      |                  2786 =  2522 +       0 +     264                |

Discussion

From what I can tell, other backends, such as CUDA, do not have this problem. Specifically, CUDA uses a VMM pool. I briefly considered adding ggml_sycl_pool_vmm as suggested by one of the TODOs in the code, but quickly stumbled into allocation granularity issues.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES; I used them in assistive capacity for profiling, prototyping, code review.

sanmai · 2026-05-05T22:55:04Z

+
+    sycl::half * alloc(size_t n_elems) {
+        ptr = buf.ensure_half(n_elems);
+        return ptr;


the calling code does not use the return value but pool_alloc::alloc returns so keep doing that too to reduce the surprise if someone changes the old code to use the return value later

sanmai · 2026-05-05T22:59:37Z

-    ggml_sycl_pool_alloc<sycl::half>   K_f16(pool);
-    ggml_sycl_pool_alloc<sycl::half>   V_f16(pool);
+    ggml_sycl_fattn_alloc        K_f16(fbuf.K);
+    ggml_sycl_fattn_alloc        V_f16(fbuf.V);


I considered adding a no-op template to make it look more uniform:

ggml_sycl_fattn_alloc<sycl::half> K_f16(fbuf.K); ggml_sycl_fattn_alloc<sycl::half> V_f16(fbuf.V);

arthw

It's good job!
I test it and the memory usage is reduced. It's great to users.

comments:

DEBUG_SYCL_POOL
Please update the description in SYCL.md
The feature is about Flash-attention, please move the code in fattn-common.hpp or fattn-xxx.cpp/hpp. flash-attention is big feature, suggest moving all code in fattn-xxx files.

common.hpp is for more common code, and ggml-sycl.cpp is for base API of SYCL backend.

Thank you!

sanmai · 2026-05-08T06:58:01Z

I appreciate the review! Should be ready for another round.

arthw

It's good job!

Thank you!

ggerganov · 2026-05-09T05:47:31Z

From what I can tell, other backends, such as CUDA, do not have this problem. Specifically, CUDA uses a VMM pool. I briefly considered adding ggml_sycl_pool_vmm as suggested by one of the TODOs in the code, but quickly stumbled into allocation granularity issues.

Would the VMM solution be re-considered in the future? Maybe add TODOs?

sanmai · 2026-05-09T06:24:59Z

VMM is next once this goes through (the guidance says 1 PR in flight at most for new contributors); that said, it this approach still works to decrease the unaccounted memory usage even together with VMM. So they aren't exclusive.

- without vmm pool:
+ with vmm:
 common_memory_breakdown_print: | memory breakdown [MiB]                        | total   free     self   model   context    compute    unaccounted |
-common_memory_breakdown_print: |   - SYCL0 (Intel(R) Arc(TM) Pro B60 Graphics) | 23256 =   56 + (21100 = 16104 +    4501 +     495) +        2098 |
+common_memory_breakdown_print: |   - SYCL0 (Intel(R) Arc(TM) Pro B60 Graphics) | 23256 =  223 + (21100 = 16104 +    4501 +     495) +        1931 |
 common_memory_breakdown_print: |   - Host                                      |                   958 =   682 +       0 +     276                |

So VMM improves things (1931 vs 2098 without) but not as much as the current approach (1483 vs 2398).

And there are many other uses for VMM.

SYCL: reduce allocation overhead during flash attention

ad72f6a

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 5, 2026

sanmai commented May 5, 2026

View reviewed changes

tidy up whitespace

7844365

sanmai marked this pull request as ready for review May 5, 2026 23:05

sanmai requested a review from a team as a code owner May 5, 2026 23:05

arthw reviewed May 7, 2026

View reviewed changes

sanmai added 3 commits May 8, 2026 14:00

add a note about the flag

f6b9bbb

move ggml_sycl_fattn_* into fattn-buffers.hpp

af4f8e7

refactor implementation into fattn-buffers.cpp

5f76f21

sanmai requested a review from arthw May 8, 2026 06:58

github-actions Bot added the documentation Improvements or additions to documentation label May 8, 2026

arthw reviewed May 8, 2026

View reviewed changes

Comment thread ggml/src/ggml-sycl/fattn-buffers.cpp

move new_fattn_kv_buffers back into ggml-sycl.cpp

9dd6687

arthw approved these changes May 8, 2026

View reviewed changes

arthw added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label May 9, 2026

ggerganov merged commit e20b839 into ggml-org:master May 9, 2026
49 checks passed

sanmai deleted the fa-overhead-sycl branch May 9, 2026 06:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SYCL: reduce allocation overhead during flash attention#22732

SYCL: reduce allocation overhead during flash attention#22732
ggerganov merged 6 commits intoggml-org:masterfrom
sanmai:fa-overhead-sycl

sanmai commented May 5, 2026 •

edited

Loading

Uh oh!

sanmai May 5, 2026 •

edited

Loading

Uh oh!

sanmai May 5, 2026 •

edited

Loading

Uh oh!

arthw left a comment

Uh oh!

sanmai commented May 8, 2026

Uh oh!

Uh oh!

arthw left a comment

Uh oh!

ggerganov commented May 9, 2026

Uh oh!

sanmai commented May 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sanmai commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Baseline (without buffers)

With buffers

Discussion

Requirements

Uh oh!

sanmai May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanmai May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arthw left a comment

Choose a reason for hiding this comment

Uh oh!

sanmai commented May 8, 2026

Uh oh!

Uh oh!

arthw left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov commented May 9, 2026

Uh oh!

sanmai commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sanmai commented May 5, 2026 •

edited

Loading

sanmai May 5, 2026 •

edited

Loading

sanmai May 5, 2026 •

edited

Loading

sanmai commented May 9, 2026 •

edited

Loading